Requirements of a Web Crawler's Design

Requirements#

Let’s highlight the functional and non-functional requirements of a web crawler.

Functional requirements#

These are the functionalities a user must be able to perform:

Crawling: The system should scour the WWW, spanning from a queue of seed URLs provided initially by the system administrator.

Points to Ponder

Question 3

How do we select seed URLs for crawling?

Hide Answer

There are multiple approaches to selecting seed URLs. Some of them are:

Location-based: We can have different seed URLs depending on the location of the crawler. Category-based: Depending on the type of content we need to crawl, we can have various sets of seed URLs.
Popularity-based: This is the most popular approach. It combines both the aforementioned approaches. It groups the seed URLs based on hot topics in a specific area.

3 of 3

Non-functional requirements#

Scalability: The system should inherently be distributed and multithreaded, because it has to fetch hundreds of millions of web documents.
Extensibility: Currently, our design supports HTTP(S) communication protocol and text files storage facilities. For augmented functionality, it should also be extensible for different network communication protocols, able to add multiple modules to process, and store various file formats.
Consistency: Since our system involves multiple crawling workers, having data consistency among all of them is necessary.
Performance: The system should be smart enough to limit its crawling to a domain, either by time spent or by the count of the visited URLs of that domain. This process is called self-throttling. The URLs crawled per second and the throughput of the content crawled should be optimal.

Resource estimation#

We need to estimate various resource requirements for our design.

Assumptions

These are the assumptions we’ll use when estimating our resource requirements:

There are a total of 5 billion web pages.
The text content per webpage is 2070 KB.
The metadata for one web page is 500 Bytes.

Storage estimation#

The collective storage required to store the textual content of 5 billion web pages is: $Total\ storage\ per\ crawl = 5\ Billion \times (2070\ KB + 500B) = 10.35 PB$

Traversal time#

Since the traversal time is just as important as the storage requirements, let’s calculate the approximate time for one-time crawling. Assuming that the average HTTP traversal per webpage is 60 ms, the time to traverse all 5 billion pages will be:

$Total\ traversal\ time = 5\ Billion \times 60\ ms = 0.3\ Billion\ seconds =$ $9.5\ years$

It’ll take approximately 9.5 years to traverse the whole Internet while using one instance of crawling, but we want to achieve our goal in one day. We can accomplish this by designing our system to support multi-worker architecture and divide the tasks among multiple workers running on different servers.

Number of servers estimation for multi-worker architecture#

Let’s calculate the number of servers required to finish crawling in one day. Assume that there is only one worker per server.

$No.\ of\ days\ required\ by\ 1\ server\ to\ complete\ the\ task = 9.5\ years \times 365\ days \approx 3468 \ days$

One server takes 3,468 days to complete the task.

How many servers would we need to complete this same task in one day?

We would need 3,468 servers to complete the same task in just one day.

Bandwidth estimation#

Since we want to process 10.35PB of data per day the total bandwidth required would be:

$\frac{10.35PB}{86400 sec} \approx 120 GB/sec \approx 960 Gb/sec$

$960Gb/sec$ is the total required bandwidth. Now, assume that the task is distributed equally among $3468 \ servers$ to accomplish the task in one day. Thus, the per server bandwidth would be:

$\frac{960Gb/sec}{3468\ server} \approx 277 Mb/sec\ per\ server$

Number of Webpages	5	Billion
Text Content per Webpage	2070	KB
Metadata per Webpage	500	Bytes
Total Storage	f10.35	PB
Total Traversal Time on One Server	f9.5	Years
Servers Required to Perform Traversal in One Day	f3468	Servers
Bandwidth Estimate	f958.33	Gb/sec